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instrument development from the history of Project Follow Through, 
suggesting that' the value of an instrument may be overlooked because 
the instrument is judged by criteria inappropriate to the original 
motivations behind its development effort. Section 4 attempts to go 
beyond the statement of the problem to suggest how thinking of a test 
as a source of individual learning might guide test development in 
nontraditional ways. Section 5 sums up some of the possible 
connections between testing and various social functions , pointing to 
some alternate ways in which standardized testing may serve goals of 
evaluation. (RH) 
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I. INTRODUCTION . & ' * 

. " • i 

\Hbw can standardized tests be better developed to improve educational 

program evaluation? This question is the subject of this paper. I should 

hasten to make clear that I have no readj^made answers to this question. 

Itathar I have some suggestions aboui^ ways of approaching, the vquestion 

— approaches which lead/ I tjiink, toward some fkther uncommon formu- 

/ 

lations of the possible relationships betwpen standardized testing and 

educational evaluation.. s ": . \ 

By way of introduction I should explain what I mean by stan- 
.dardized test. I use the phrase .standardized test in a fairly general 
sense to mean a systematic device for eliciting and recording a sam~- 
pling of skills, knowledge, or attitudes, In this definition, I in- 
•elude such commonly recognized tests as aptitude and achievement tests, fc 
and norm- and driterion^referenced tests, and techniques such as 
systematic observation, and rating instruments, but exclude, at least 
for the sake of this discussion, teacher-made or classroom tests. \ 

• Specifically, my Initial thesis in this paper is that educational 
tests are typically developed in teas of tw6 functions traditionally + 

assumed pf educational, tests — namely, selection and formal inference ~ 
i * - 

but that these functions may not fit very -well with some of the current 

/ ■ ' . 

■ social functions of educational testing. Educational tests, for example, 

\ , /. . ^ 

seem to be serving more and more nowadays as a medium of communication, 

for discussion and debate over the goals. and priorities of schooling. In- 

l ' • * . / - ' 1 

deed the minimum competency testing movement of recent years could be 
(J^l viewed as a social conversation on what should be the main aims of elem- 
entary and secondary schooling. For instance, when legislatures and 

<► * * ' 
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special committees debate whether high school graduation tests should 
cover "school skills" or "Ufa skills," they are implicitly debating 
alternative aims of schooling. 

A second example is that tests sometimes serve as social* standards. 
Indeed, tests are wltiely perceived as devices for upholding educational 
standards, as for instance when they are viewed as antidotes to grade 
inflation or instruments for adding meaning to the high school diploma*. 
To some extent, of course standardized toasts .already* do serve as socia^ 
standards. Indeed the notion is implicit in the phrase standardized 
tests. But note that if one set out to develop a test as a srocial or 
educational standard, it might not be necessary to employ the traditional 
techniques of test development. \ 

A third example is that children learn directly from tests. Students 
may, of course learn indirectly as a result of tests in any number of 
ways — because of college admissions decisions based on test results * or 
through teaching "based on test results. But what I would like to explore 
is how individuals might learn directly from tests and test results, and 
how tests might be developed differently if this were one's aim. », 

To-explore these issues, this paper is organized as follows. Sec-* 
tion II describes the disjunctive to which I alluded above, namely that 
tests developed in light of. the function of selection and inference may 
not well serve other functions. To make this thesis clearer, section 
III will recount an example of instrument development^ from the history 
of Project Follow Through to suggest that the t vaiue of an instrument 
may have been overlooked because it was judged by criteria inappro- 
priate to the original motivations behind the instrument developnent 
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effort. Section IV attests to go beyond the problem, outlined tq sug- 
gest how thinking of a test "for a particular function, namely as a 
source of individual learning, might guide test development in ways 
somewhat different than those suggested by tradilKonal standards of 
t£$t development. The closing section, V, sums up some of the possible 
connections between testing and different social functions, and points to 
som? alternative ways in which standardized testing may serve^ goals 
of evaluation. 
II. THE PROBLEM . 

» - . . 

The thesis outlined above was that traditional methods of constructing 

standardized tests are relevant to only some of the social functions which 

4 

tests serve. To make this point clearer let me briefly describe some of the 
considerations which typically guide the construction of standardized tests. 

Norm-referenced tests of achievement, aptitude and ability Constitute 
the thickest branch in the family tree of standardized testing. The history 
of norm- referenced tests clearly suggests the success of such tests in in- 
forming selection processes. The original Binet t,est was designed, of 
course, to select Prenchschool childreH^for special instruction because 
they could nqt profit from regular instructibn. In the tremendous proli- 
feration of testing in the first World War, the Army Alpha and, Beta tests 
were used for military personnel selection. And the Scholastic Aptitude , 
Test, introduced originally in 1926 and adapted into, essential ly fc its 
current form in the 1930s, ^is probably the preeminent example of a norm- 
referenced standardized test serying selection functions. 
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The tie between ftorm^referen&d testing and thet function oi ' . 
selection is apparent not just in historical perspective, but also 
►in the techniques used to construct nornwreferenced tests (NRT), fi 
^ejn difficulty and/ item-test correlations, for example, are two of th? 
most wide!/ usod-cf^ireiMa'in terras pf which candidate items are selected 
for inclusion in norm- referenced tests. Also,^of course constructors of 
NRTs must pay heed to item content specifications, but as the technical 
report on thfe SAT notes, content specif icationsT are "necessarily less 
rigorous" than difficulty and item-test correlations (Angoff, 1971, p. 9). 
Now in terms of unitary selection decisions, these criteria contribute to 
important overall test characteristics. Difficulty contributes to the 
test's power to discriminate among test takers— an important character- 
istic of a selection test, since practical selection decisions are almost 
always constrained in, that some candidates must be selected, but not all 
can be/ Similarly, item-test correlations contribute^ to the construct 

coherence of the selection instrument. If one is, faced vdth a binary 

* * * » * 

selection decision — that is to selfect pr not such, an attribute 

^ * j 

surely can make matters simpler than if £ selection instrument tapped- 

several ? different construct?. 4 

■ */ * *■ 

/ Nevertheless, desirable, though these test characteristics may be 

from a selection perspective, critics' of NRT have noted in tecent years 

that these characteristics may not be desirable, or may even be unde-* 

sirable, in light of o€her functions- that tjssts may serve. . Indeed, -it 

is thinking along this line which has powered much interest in cWterion- 

referenced tests in the last decade or so.'° k 
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Several observers, for example, have directly critici 



spread use of norm-referenced standardized tests fh program evaluation 
(among, others, Glaser, 1963; Carver, 1974; Popham, 1978: Madaus et al., ' 
1979). The argument, in abbreviated form, goes roughly as foll,ow^. Sin£e 
norm-referenced tests were designed to serve selection purposes ajnd 
hence to discriminate efficiently among individual test j takers, they 

V, i * 

have* been constructed to be insensitive to effects of instruction in 
local, school systems, which may have different curricula. Now tests 
are increasingly being used to evaluate educational programs and to 
guide instruction. However; precisely because of ?the way they are 
c&nstructed, norm-refeirenced tests jtend to b$ insensitive to the in- 
structional, effects of particular educational programs. Hence new ■ £ ■ 

* e *. * 

types of tests' are required for the purposes of program evaluation. 

More extreme critics of norm-referenced tests have extended this 
argument; they predictj that the weaknesses of norm-refferenced tests 
♦will usher in aSi<wjj*riod o£< educational assessment "ttie criterion- 

referenced measurement eta" {Popham, 1978,p.2, emphasis in original), 

■ * * . 

. More moderate oSservets have suggested merely that curriculum-sensitive 

te^tsjjpn play an important, ^role in prbgram evaluation, even though 

n^rm-refe^ence4 tests may continue to be valuable conparisons of the 

^^^ucatittnal outcomes of programs that emphasize different aspects of / 4 

iijs'truction (Madaus et al., 1979) % • 

If we are tp judge from the continued popularity of norm-referenced 
' ',. ' ' \ • , 

-tests, it seem^ doubtful that the criterion- referenced era is yet upon us. 

Nevertheless, there surely is much interest in criterion-referenced. , , 

tests (CRT), According to one recent review of the state of~thfe 

art of criterion-referenced measurement, so much has been written on this 
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topic that we now have available "more than fifty descriptions of a 

j ' . • .• .. • ■ 

.criterion- referenced test'? (Perk, 1980, p,5), The most widely cited 
definition appears to be that of Pophao, namely that a CRT Mis used to 

r • • •-. • , / 

ascertain an .individual 1 s status with respect to a well-defined behavioral . 
domain 1 ' (Popham; 1978, p. 93]. Given this definition, it is not surprising 
to- find it written that the most impprtant ste£> in the development of a 
CRT is ."to define operationally the domain of content .or behaviors the ; 
test is to measure" (6erk, 198Q, p,13). * 

Yet when one famines recent literature on criterion-referenced 
measur?|6ent, a curious pattern is apparent. Far mote has bfeen written 
on technical issues of validity and reliability than on the "aost important 1 ' 
step of defining what it is that a GIT is to measure, In Bexk's C19S0) book 
on the state of the art of criterion-referenced -measurement, for, example, 
the*two brief chapters on domain_specification/item generation contain a 

"scant 34 references whereas th6 bulkier four chapters on validity and 

• * ** 

reliability contain over 180 references* In other words, work on criterion- 
referenced measurement seems' to be progressing far faster on technical 
issues such' as methods of \£tein analysis, setting cut-off scdres, assessing 



( decision consistency, and applying gen erali lability theory to analyze var- 
iance in test results, than on the more fundamental issue of defining 
directly^ what it is that a criterion-referenced' test is designed to measure. 

Another means of illustrating this contrast is to cite an observation 
by Popham * in the introductory chapter in the Berk (1980) volume. Aftjter re- 

* ° . ' * ■ 

counting a variety of domain ■ specification strategies that he has tried, 

S - :' ",' ' > 

Popham observes in closihg: o • 
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Once upona\ime when I was younger and foolisher, 
I th6ught we couHL create test specifications so con- v 
'straining that the\es*yi terns produced .« . . would be 
fu nctionally honogerieoujk , that is, essentially inter- 

• . changeable. But i we use the difficulty of .an item . 

• as at least one iKdeac of the item's nature, then it 
becomes qui<te obvious that even in such. teensy be- 

\ • havior domains as measuring the studentis ability to 

\ multiply pairs of two digit numbers, tWtasx pf 

V ' n x 11 - !f lots easier than "^^^ 

\ Popham's observation nicely illustrates one of the essential problems 
of criterion-referenced measurement. It is that common constructs in terms . 
of. which we communicate about the substance and skills of learning often seem 
tb have little coherence in terms of the common coin of educational measure- 
ment: right or wrong answers, item difficulties and test scores. 

there may, of course be strategies for surmounting this apparent pro- 
blem, for example thrbugh-longer tests, multiple measures, or statistical 

/ ■ 
equating of Various sorts. But jay point in this paper is not on such 

theoretical problems^ Rather ,1 mean to suggest simply that many of the 
important social functions ef educational tests may not depend on issues 
df formal inference, and that judging test instruments only or largely in 
terms of standards of formal inference may limit other social functions 
of tests. To illustrate this point I will bo on to suggest that if we 
view standardized tests not simply as^measurement instruments but as 
sources of direct learning, then perhaps we might develop them in dif- 
ferent ways. f ** 

nI# an EXAMPLE FROM THE HISTORY OF THE NATIONA L FOLLOW THROUGH EVALUATION. 

To illustrate my 'thesis that judging test instruments in terms of 
techniques relevant to selection 7 and formal inference may hinder their 
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application for alternative functions, in this section I recount one 
small portion of the history of the national evaluation of Project Follow 
Through When Follow Through evaluation results were released .in 1977, 
there ensued much debate about the narrowness of the outcome measures 
uSed, aw| : the limited scope of the evaluation (House, et al; 1978). What 
was widily overlooked in the controversy over the FT evaluation results, 
howeve^, was that a huge amount of effort was actually invested in assess- 
ing a wide range of the broad goals of FT. Indeed, through 1977 it was 
estimated that around $50 million or roughly 10 percent of total FT pro- 
gram costs were invested in the National evaluation CHaney, 1977,p.2). 
As far as I know this amount far surpasses typical program investment, in 
evaluation. So if the FT evaluation was overly narrow it was surely not 
for want of resource investment in the task. 

Now much of what was tried in the FT evaluation died or disappeared 
before it ever reached fruition. As f^observe<Tin writing a history of 
FT, the FT evaluation over fcime underwent tf a sort of funnel, vision/ 1 with 
dozens of questions asked of the evaluation at one time or another falling., 
by the wayside (H^ney, 1977, p. 295). There were several reasons for 
the sloughing off of questions in the course of *the FT evaluation. I will 
rfbt even try to mention most of them here. Nevertheless, one dause rele~ 
vant to the present topic, was tne way in which evaluators went about devel- 
oping ajid judging the quality of evaluation instruments. \ 1 

To illustrate how this worked, let me briefly recount the history 

) ' % 

of parent interview data in the FT evaluation (summarized from Haney, 1977, 

pp. 95, 258-269). _ F v rom the very inception, of FT, official "program docu- 

ments stressed the importance of involving parents in the- program. Indeed, 

when official rules and regulations for the program were finally promul- 

gated in 1977, one of seven explicitly stated evaluation .criteria for FT 

10 . ■ 



was^the "extent of parent^hvolvement." Given this emphasis, it is not . 
surprising that -considerable attention was given,* as, early as 1968, toY* 
interviewing parents of FT children, in part to obtain data on their in- 
volvement in the JT program; Between 1968 and 1975,, over 60,000 parents 
were interviewed by the National Opinion Research Center to gather data for 
the national evaluation. Yet by the tine of the, final Abt evaluation report 
of FT, these data, gathered at. tremendous^ "expense, had almost completely 
disappeared from view. They were not even mentioned in the final "patterns 
of effects" chapter in the main volume of the final Abt report, nor in the 

Abt digest of evaluation findings . 

"There were several reasons for the virtual disappearance of the 

parent data, including ambiguity of purpose behind their .gathering, 
organizational discontinuities, and simply too many demands on e val- 
uators and too little time and resources to respond fully to all that 
different parties wanted done. v But beyond such practical problems 
lay another Cause, name/y how evaluators went about analyzing the 
parent interview data, and assessing what they measured. Over the four 
years of the Abt evaluation effort, a "variety of factor and cluster 
analyses were performed on the parent interview data. Now those technique* 
are widely recognized means of developing tests and understanding the- 
meaning of test data,, by identifying the constructs measured by 
data. But t$e problem which arose in applying these techniques to 
" FT parent interview data was that from one year to the next, the re- 
suits never turned out quite Jhe same. Successively the Abt evalua-- 
tors derived six clusters one yea^eight clusters the second year, 
ten factors the third, and thirteen factors In the final, fourth year 
of analysis! Although some clusters and factors from the different 
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yekrs of analysis contained th#. same parent interview questions., more 
often than not corresponding, dusters and factprs^alsp included, dif- 
ferent i^erview questions* Such discontinuity across years of anal- 
ysis quite effectively prevented any comparisons. of results across 
years. While there are several alternative explanations *ior the' 
virtual disapiperance of the parent interview datf in the; national FT 
evaluation effort, one is th^s, • The parent interview dati+gathering* 7 
Was instituted to gather information />n important aspectsof the FT 
program/ but data analysis designed to ascertain vhat constructs were repren- 
sented in the parent interview- data', revealed tlt^^ttiey tapped no cleitly c 
consistent constructs across different years of data gathering/* It is in 
a way, Popfiam' s^point wtit large/ /^ugh tfle parent interview data had ■ 
some coherence in terms of what, was asjced, the results of interview ques- 
tions turned .out to have little construct coherence in terms , of interview 

i ■ . - , .. • » * 

• .. *,-.■ ■.■'. .. . "• • : 4 ■ • 

. responses. . - - - . , ? 

This episode illustrates the disjuncture td which- I alluded in the 
introduction >- namely a measurement proceduy^ii^st^tuted for one set 



of reasons being judged in terms of techniques which imply another pur~ 



pose. Specif ic&lly the parent interview data^gathering was instituted 
as a means of responding to one; important aspect of FT, but the results 
came to be judged in terms of techniques namely factor and cloister 
analysis -r aimed essentially at identifying' the construct coherence * 
of parent interview responses. Such coherence is hot, however^ neces- 
sarily relevant to the original goals motivating the ,'endeavoxrV ^ Indeed 

when the parent interview data were reanalyzed, using simple cross tab- 
• * ' • ■'. ■ ' 

ulations, and on a judgmental basis grouping together items relevant /to 
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specific FT goals, it was found that patterns of parental responses^corres- 

• pond in many cases witlr precisely what could be expected in pevms of the 

goals 9f different FT models (Haney and Pennington, 1978, pp. 103-104). 

Such correspondence could not of course be inferred with any great degree 

oi: confidence to be effects of FT model programs; but my point is that such 

t .. \ , ' i ■ * 

simpler techniques, oriented* more toward description than to inference, may 

have been more congruent with, the original motivation behind introduction of 
parent interviews into the FT evaluation effort., 
IV. DEVELOPING TESTS AS INSTRUMENTS FOR LEARNING ,. 

If one accepts the proposition that commonly recognized techniques 
of test development, including both well-established techniques of NRT 
construction, and newer prescriptions on CRT construction, may be coun- 
terproductive with respect to functions of tests other than selection 
and formal inference, natural next questions are: 1) What other important 
social functions do tests serve, and 2) How could tests be developed so as 
to. enhance those functions? In the introduction I suggested several dif- 

■ ♦ . 

. ferent social functions which tests seem to be serving, namely as media 
for educational communication, as educational standards and as sources 
of learning. I will not try to speculate here on how tests might be 
developed differently if aimed at each of these, or other particular 
functions. Rather simply as a way of illustrating my more general, point, 
I will attempt to suggest what considerations might go into developing 
tests -as learning instruments. / 

A reasonable place from which to begin this exploration is simply to 
ask what makes for effective learning. Obviously different people have 
different answers to the question, but as a means of illustrating this ..■ 
approach to- thinking about test development,' let me work. with one particular 



set of theories of learning; namely Benjamin Bloom's writing on Human 
Characteristics* and School Learnin g Q976) , and his theory of mastery 

iearning, V " 

Bloom's theory encompasses the full range of the learning process 
including student characteristics, instruction, and learning outcomes. 
His observations on each of; these area* have implications, I think, for 
how one might think about^test development. Nevertheless, let me focus 
here on instruction,, and specifically Bloom's observations on critical 
aspects of quality instruction. Bloon suggests that four characteristics 
seem to be importing: cues, participation, reinforcement, and feedback. 
Before elaborating on what* Bloom means by these terms let me note simply 
that one need not accept Bloom's theory lock, stock and barrel to be inter- 
ested' in these characteristics. As Bloom. himself Suggests, . these aspects 
of quality instruction can be identified in other theroies of learning. 
Indeed, with respect to the first three, Bloom maintains that "although, 
the terms may differ, they can be found in some respect in almost every 
theory of learning as summarized^ by Hilgard and Bower (1966)"' (Bloom, 
1976, p. 172). 

,So what are these four features of quality instruction? Bloom 
describes them mainly in terms of tutor-student learning arrangements, 
. but since I wish to suggest . their broader applicability, I recount Bloom's 
description in paraphrase. Having done so, I will proceed to suggest what 
they imply for test development if we view tests as learning instruments. 

Cues. It is made clear what is to be learned, what the student is 
to do, and how he is to do it. Cues can be altered or adapted to present 

those which work best for particular learners. For some students the cues 

% " ' 

. ' , ' Mi . 
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can be derived from written materials;, for others it may be oral explana- 
tions; and for still others it may. be' combinations of demonstrations or ; 
models'with explanations', and so forth. 

• Participation \ the learner actively participates or practices, the. 
responses to be learned. While some of this participation may be overt 
and observable, it : is 'also likely that covert participation may be as 
effective in some Situations as the mire' overt or observable particU ' 
"pation. There may b«5 'individual differences in the amount of practice 
or participation needed. • ■ » 

l- Reinforcement . Positive or negative reinforcement is used at various 
stages of the learning process. Reinforcers are adapted to the learner 
since what is an excellent reward for one student may not operate in the 
same way for another! A variety of reinforcers (both extrinsic and in- 
trinsic) are used. 

Feedback . Individual students receive evidence on the effectiveness 
of the learning process. Relatively rapid corrective feedback is provided 
when and where needed. "Furthermore, through the use of a variety of in- 
structional materials, students helping each other, or tutors or aides, 
mastery learning procedures have made it possible to quickly apply correc- 
tives with regard to cues participation and reinforcement where the learners 
have specific difficulties in the learning process" (pp. 172-173). 

Now suppose we accept Bloom's formulation of these aspects as critical 
components of ah effective learning system. Suppose further that we' view 
tests not just as measurement devices from which teachers or tutors derive." 
information to use in applying Bloom' s 'ideas to instruction, but also 



as learning instruments from which test-takers might learn directly. 

h . .. ' 

15 
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From this -perspective and in light of Bloom's critical features of 
a learning system,' how might tests he developed differently than they. . 
typically are at present? Bloom's advice regarding cues suggests that 
tests might be more clearly labelled; not in terms of psychological 
constructs or abstract learning domains,, but instead, in terms more fam- 
iliar to student. tes,t-takers. . The-idea of adaptable modes of present- 
ing cues also might imply alternative means of test presentation; for 
example, oral, written and demonstration. When tests are viewed 
/strictly as measurement devices, such alternative modes might be viewed 
as a problem, namely as extraneous sources of error variance. . But from 
the learning perspective, alternative modes. might be viewed more posi- 
tively as differentially appropriate for students with different Teaming 

styles. s 

bloom's notions of participation seem to imply several alterations 
from traditional test development procedures. At a minimum they suggest 
less emphasis on external control over administrative conditions, and 
scoring of results. Test items that are either self-scoring or- .score- 
able by the student him- or .herself would, for example, seem to have 
considerable potential for enhancing active learner participation in 
assessment. Likewise, the notion that different learners may. need 
different amounts of participation and practice would suggest that tests 
would not necessarily need to be of uniform length for all test-takers. 

Bloom's third aspect of quality instruction is \r«in £orcement * either 
positive or, negative, at various stages in the learning process. He notes 
further that - what is excellent reinforcement for one student may not operate 

16 
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in the' same" way '.for another student^ This suggests that reinforcement . ^ 
which students derive from^te^ts might best take diff&en{ forms. For. 
example,, instead of all students receiving ^overall percentage correct , 
scores -- or some norm-referenced or criterion-referenced score derived 
from percentage correct — perhaps instrument scoring procedures could 
be adapted so that, test-takers could receive results in, the formof 
item types or sets in which they scored highest (positive reinforcement) 
or lowest (negative reinforcement) . 

Bloom's recommendatibns regarding rapid feedback suggest that tests 
might be constructed, not only so that they are self-scoring or score- 
able by the test-taker him- or herself, but also so that results convey 
, specific information or/ cues on types of errors or sources of infor- 
nation on corrective instruction. With regard to self-scoring, for 
'example/might it not be possible for tests to employ materials and 
techniques' already used in instant lottery tickets, so that test-takers^ 
could gain immediate feedback on whether their answers were right or 
wrong. Such self-scbring'answer sheets have been used as far back as 
1935 in the Henmon-Nelson Test of Mental Ability Ovhich used the Clapp- 
Young self-marking device patented in 1929) ,as an aid to test adminis- 
trators, but as far as I know such techniques have not been widely 
viewed as a potential source of enhancing test- taker participation in 

A 1 . ' 

the assessment process. 



L I know of :little research bearing direct If on the issue of immediate 
feedback of test-results. One relevant study/ of computerized adaptive 
testing, concluded that -teste*. reacted very favorably to the provi* 
sion of knowledge of results" and that this knowledge of results in- 
creased average testee motivation." (P.restwood, 1978, p. 105) 



The only instrument' of which I know that has employed such techniques, 
in this way is the TORQUE developed at the Education Development Center, 
but unfortunately this unusual test development effort seers $o have 
come to a halt before any large-scale try-out and -evaluation could be 
accomplished. 

In short, this brief review of how tests might be developed as 
instructional devices, specifically as direct aids to individual learn- 
ing suggests that tests developed with this* aim in mind might have several' 
features which are not now found in most .standardised tests. Specifically, 

they might x f 

- be available in alternative modes of presentation 

- be' labelled in terms familiar to test- takers rather than 
in* terms of psychological constructs on behavioral domains 

- not require standardized administration 

- be self^scoring or scoreable by individual test-takers 

- be of variable length 

- provide results no % t only on whether answers are right or wrong 
but on the nature *of errors or sources of corrective instruc- 

"' . • tion. * \ 

The process of developing tests with such characteristics obviously 

would entail less attention to the artifacts of tests — namely the score 

results in terms of which the qualities of standardized tests typically 3 

•are judged — and more attention to the content of test questions and \ the 

way in which individual test- takers interpret and react td them., r It 

would,, fOr instance, require something akin to what curriculum developers 

call learner verification, and less attention to tests and test items as 

strictly measurement devices, to their discriminatory power, and to their 

empirical construct coherence. • 
; 18 ' 
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V. tEARNING, MEASUREMENT AND EVALUATION . 

These ide*> obviously raise-*he question of whether tests with the 
characteristics I have described would really be* tests as this term is 
commonly understood. After all? standardized tests are more commonly 
thought of as instruments of educational measurement, than as ins^^n" 
of learning, or, educational standards, or media of comaunication.^^My 
answer, is yes, for what I have been suggesting is exactly that standard- 

s ized tests In the various roles they serve already are not and need not . 

' r > ■.. '■): 
be viewed simply as measurement instruments. • 

Why? Because the limits of measurement are quite severe. In argu- 
ing this point, I discount the broader definitions of measurement I- for 
example, S. S. Stevens' view that measurement is simply "the assignment 
of numerals to things so as to represent f acts^and conventions about 
them" (Stevens, 1960, p. 148) ^ and Ernest Nage't's sweeping definition 
that "measurement can be regarded as the definition of and fixation of , - 
Our ideas of things so that the determination of what it is to be a man or to 
be a circle is a caae.of measurement" (196* .p. 121). ' instead, I refer more ■ • 
narrowly to Lyle Jones definition that ••measurement . . . is a det'ermina^ 
tion of the magnitude of a specified attribute oflthe object, organism.! . 
or event in terms of a. unit- of measuremenf 5 ,G197n •' Given this definition, 
and as 'long as we discount "tautologies of the sort advanced with respect 
to intelligence tests— namely that intelligence is what intelligence tests 
if measure-my point is simply- that there' is 1 much in education and many sorts 
" of learning which cannol be measured', whose magnitudes cannot be determined. 
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More generally, it seems quite clear that many social functions 

v 

of .standardized tests are not dependent on their qualities as measure^ 
ment devices. This point can bq illustrated by referring to the Eighth 
Measurements Yearbook cW 1 ** Buros, 1978} • As the introduction to this 
massive two-volume publication points but, the two most widely cited test 
instruments are the Minnesota ^fultiphasic Persbnality Inventory .and the 
Rorschach — each with around 5000 cumulative total references in the 
BurosV series of publications, while the average number of references for 
instruments listed in 8MMY is only 25 or so (Buros, 1978, p.xxxix). Why 
shoula these tests be so widely used? Surely- it is not because of their 
proven validity and reliability as measurement instruments. As one re- 
viewer of the Rorschach suggests, . ' 

Certainly the validity* 1 reseatch on the Rorschach does 
not warrant its popularity. Rather it seems it is the 
role' the Rorschach has played within the psychod>iiamic 
oriented approach to psychopat^ology that has resulted 
. ifi its popularity. Few instruments provide data so 
rich with hypothetical; dynamic associations as does 
i the Rorschach. When the goal of assessment is to for- 

mulate complex personality structures and complex dy- , 
namic interactions as the cause of the observed behavior, 
the Rorschach elicits responses which can be multi- 
interpreted and combined in an endless set of associa- 
tions to produce speculative complex hypotheses and 
interpretations., 

(Peterson, in Buros, 1978, p. 1042) 
If I may offer a uni- interpretation of that passage', it seems as if this 
fellow is saying that the Rorschach is popular not because it helps answer 
questions, but because it multiplies them. This suggests that standardized 
tests, for at least some purposes, are valued not as valid and reliable mea- 
surement instruments per se.but because they yield information which can be 
interpreted in numerous different ways. 
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' It is an unusual perspective on the value of test^ infonpation, ^ . 
but oddly enough it seems flot too different from some recent thinking 
about program evaluation,. Recall that not toq.many yea^s ago, educa- 
tional program evaluation was viewed mainly as applied social science 
research in the service of decision-making. Emphasis was on estimating 
effects of educational programs, most often. by using standardized tests. 
But research on the utility of evaluation research has shown that eval- 
uation findings rarely seem to. have contributed directly to decision- 
making in the way that was expected (Cohen & Garet, 1975; Weiss, 1977). 
Instead, it seems often to be used in a more general way, Wrectly . 
influencing the way in which people thjhk about, education and educational 
programs. At least partly as a resuVCLny seem now to think of .pjfgram 
evaluation less as applied science and more as a descriptive enterprise, 
with' more attention given to program imp>^entationJnd depictiojv-of how 
programs operate, even if their effects cannot be confidently measured. 
Evaluation as effects measurement is, of course still alive and well in 
some quarters, but we also now have evaluation as investigative reporting, 
evaluation as story- telling, and evaluation as art. From this angle a'^ 
more general way of making the point^f this paper, is. simply to kay 
that to the extent that program evaluation has shifted away .from iShe goal 
of formal inference of program effects, perhaps also testing as pak of the 
evaluative enterprise should also be aimed less at formal inference, and • 
selection and more at description. Test instruments as vehicles for com- ^ 
munication and sources of direct learning may not, I realize,, seem terribly 



relevant to conceptions of evaluation as applied research. 1 But such 

roles may nevertheless serve the larger meaning of evaluation and its 

ultimate goal. For if we take the meaning of evaluation to be ascer- 

i 

taining values of programs, it is clear that this can never be reduced 
strictly to a technical or scientific affair. And if the goal of educa- 
tional evaluation is improvement of education we need not restrict our- 
selves to a paradigm by which evaluators produce knowledge to give to 
educators for purposes of educational improvement. Perhaps instead we 
might view the role of evaluators as providing tools to educators and 
society generally with which to communicate about education goals and 
values, and as providing instruments to learners to improve learning. 



This point should not, however, be overstated. For one of the signi^i 
cant features of thinking on social science research in recent years is 
that it need not, and perhaps should not strive at building all powerful 
theories and parsimonious generalizations, but instead should attend to' 
fuller and more thorough descriptions. For example, Cronbach recently 
argued: 

Social scientists generally, and psychologists in particular, 
have modeled their work on physical science, aspiring to 
ajnass cfapirical generalisations, to restructure them into 
more general laws, and to weld scattered laws into coherent 
theory. That lofty aspiration is far from realization. - . . 
Social scientist are rightly proud of the discipline we 
draw from the natural-science side of our ancestry. Scienti- 
fic discipline is what we uniquely add to the time-honored 
ways of studying man.. Too n&row an identification with % 
science, however, has fixed our. eyes upon ah inappropriate 
goal. The goal of our work, I have argued here, is not to 
amass generalizations atop which a theoretical tower can 
someday be erected Ccf. Scriven, 1959b, p.-471). The special 
task of the social scientist in each generation is to pin 
down the cpntempory facts, 

(Cronbach, 1975) 
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